Section: New Results

Tweet processing

Participants : Éric Villemonte de La Clergerie, Djamé Seddah, Benoît Sagot.

In the context of the SoSweet and Parsiti ANR actions, we run various experiments on large amounts of tweets.

In a first experiment, around 20 millions tweets were normalized, and then parsed with FRMG. A first observation was that the current level of pre-parsing normalization was not sufficient to ensure a good parsing coverage with FRMG (around 67%, to be compared with around 93% on FTB journalistic texts), also leading to high parsing times because of correction strategies. However, error mining was tried to identify a first set of easy errors and further developments are planned to track errors more related to segmentation and normalization. Clustering and word embedding were also tried for lemmas relying on the dependency parse trees, again leading to semi-successful results due to the poor quality of the pre-parsing phases.

In a second experiment, we adapted our two clustering (DepCluster) and word embeddings (DepGlove) algorithms to take into account non-linguistic relations, such as the author-word relation (between an author and the words of her tweets). The algorithms were applied on raw tweets with only a basic tokenisation, and results produced on a month basis over 18 months (2016/02 to 2017/08). Several tools, with a special focus on Cytoscape, were tried to visualize the results as networks, in order to identify and explain communities.